domain-specific data
How Well Do LLMs Predict Human Behavior? A Measure of their Pretrained Knowledge
Gao, Wayne, Han, Sukjin, Liang, Annie
Large language models (LLMs) are increasingly used in economics as predictive tools--both to generate synthetic responses in place of human subjects (Horton, 2023; Anthis et al., 2025), and to forecast economic outcomes directly (Hewitt et al., 2024a; Faria-e Castro and Leibovici, 2024; Chan-Lau et al., 2025). Their appeal in these roles is obvious: A pretrained LLM embeds a vast amount of information and can be deployed at negligible cost, often in settings where collecting new, domain-specific human data would be expensive or infeasible. What remains unclear is how to assess the quality of these predictions. This paper proposes a measure that quantifies the domain-specific value of LLMs in an interpretable unit: the amount of human data they substitute for. Specifically, we ask how much human data would be required for a conventional model trained on that data to match the predictive performance of the pretrained LLM in that domain.
Evaluating Differentially Private Generation of Domain-Specific Text
Sun, Yidan, Schlegel, Viktor, Nandakumar, Srinivasan, Zahid, Iqra, Wu, Yuping, Del-Pinto, Warren, Nenadic, Goran, Lam, Siew-Kei, Zhang, Jie, Bharath, Anil A
Generative AI offers transformative potential for high-stakes domains such as healthcare and finance, yet privacy and regulatory barriers hinder the use of real-world data. To address this, differentially private synthetic data generation has emerged as a promising alternative. In this work, we introduce a unified benchmark to systematically evaluate the utility and fidelity of text datasets generated under formal Differential Privacy (DP) guarantees. Our benchmark addresses key challenges in domain-specific benchmarking, including choice of representative data and realistic privacy budgets, accounting for pre-training and a variety of evaluation metrics. We assess state-of-the-art privacy-preserving generation methods across five domain-specific datasets, revealing significant utility and fidelity degradation compared to real data, especially under strict privacy constraints. These findings underscore the limitations of current approaches, outline the need for advanced privacy-preserving data sharing methods and set a precedent regarding their evaluation in realistic scenarios.
Improved Supervised Fine-Tuning for Large Language Models to Mitigate Catastrophic Forgetting
Supervised Fine-Tuning (SFT) is a critical step for enhancing the instruction-following capabilities of Large Language Models (LLMs) and adapting them to specialized domains. However, SFT often leads to a degradation of the model's general abilities, a phenomenon known as catastrophic forgetting. This problem is exacerbated when third-party practitioners fine-tune open-source models, as the original SFT data is typically not available. To address this challenge, we propose a novel and cost-effective SFT method that effectively mitigates catastrophic forgetting without requiring access to the original SFT data. Our approach first reconstructs the likely instruction distribution of the base model. It then employs a multi-model generation and filtering pipeline to synthesize a high-quality general-purpose dataset. This synthetic dataset is mixed with new, domain-specific data for fine-tuning. Experimental results show that our method not only preserves the model's capabilities in general domains but also improves task-specific performance, outperforming baselines that use publicly available SFT datasets.
TrustDataFilter:Leveraging Trusted Knowledge Base Data for More Effective Filtering of Unknown Information
Zhang, Jinghong, Cui, Yidong, Wang, Weiling, Cheng, Xianyou
With the advancement of technology and changes in the market, the demand for the construction of domain-specific knowledge bases has been increasing, either to improve model performance or to promote enterprise innovation and competitiveness. The construction of domain-specific knowledge bases typically relies on web crawlers or existing industry databases, leading to problems with accuracy and consistency of the data. To address these challenges, we considered the characteristics of domain data, where internal knowledge is interconnected, and proposed the Self-Natural Language Inference Data Filtering (self-nli-TDF) framework. This framework compares trusted filtered knowledge with the data to be filtered, deducing the reasoning relationship between them, thus improving filtering performance. The framework uses plug-and-play large language models for trustworthiness assessment and employs the RoBERTa-MNLI model from the NLI domain for reasoning. We constructed three datasets in the domains of biology, radiation, and science, and conducted experiments using RoBERTa, GPT3.5, and the local Qwen2 model. The experimental results show that this framework improves filter quality, producing more consistent and reliable filtering results.
Matching domain experts by training from scratch on domain knowledge
Luo, Xiaoliang, Sun, Guangzhi, Love, Bradley C.
Recently, large language models (LLMs) have outperformed human experts in predicting the results of neuroscience experiments (Luo et al., 2024). What is the basis for this performance? One possibility is that statistical patterns in that specific scientific literature, as opposed to emergent reasoning abilities arising from broader training, underlie LLMs' performance. To evaluate this possibility, we trained (next word prediction) a relatively small 124M-parameter GPT-2 model on 1.3 billion tokens of domain-specific knowledge. Despite being orders of magnitude smaller than larger LLMs trained on trillions of tokens, small models achieved expert-level performance in predicting neuroscience results. Small models trained on the neuroscience literature succeeded when they were trained from scratch using a tokenizer specifically trained on neuroscience text or when the neuroscience literature was used to finetune a pretrained GPT-2. Our results indicate that expert-level performance may be attained by even small LLMs through domain-specific, auto-regressive training approaches.
EcomGPT-CT: Continual Pre-training of E-commerce Large Language Models with Semi-structured Data
Ma, Shirong, Huang, Shen, Huang, Shulin, Wang, Xiaobin, Li, Yangning, Zheng, Hai-Tao, Xie, Pengjun, Huang, Fei, Jiang, Yong
Large Language Models (LLMs) pre-trained on massive corpora have exhibited remarkable performance on various NLP tasks. However, applying these models to specific domains still poses significant challenges, such as lack of domain knowledge, limited capacity to leverage domain knowledge and inadequate adaptation to domain-specific data formats. Considering the exorbitant cost of training LLMs from scratch and the scarcity of annotated data within particular domains, in this work, we focus on domain-specific continual pre-training of LLMs using E-commerce domain as an exemplar. Specifically, we explore the impact of continual pre-training on LLMs employing unlabeled general and E-commercial corpora. Furthermore, we design a mixing strategy among different data sources to better leverage E-commercial semi-structured data. We construct multiple tasks to assess LLMs' few-shot In-context Learning ability and their zero-shot performance after instruction tuning in E-commerce domain. Experimental results demonstrate the effectiveness of continual pre-training of E-commerce LLMs and the efficacy of our devised data mixing strategy.
Sumil Khosla on LinkedIn: FinancialAdvisor.AI
Is it worth training your own large language model (LLM) on domain-specific data from scratch? Researchers at Bloomberg did just that and shared a detailed technical report describing the dataset, model configuration, and training procedure. The core question is, is it worth training the LLM from scratch? In my experience, it makes total sense if we want to apply LLMs to novel data sources (e.g., protein amino acid sequences as ProtBERT demonstrated). BloombergGPT is a 50-billion parameter language model for finance, trained on 363 billion tokens from finance data and 345 billion tokens from a general dataset.
Context Matters: A Strategy to Pre-train Language Model for Science Education
Liu, Zhengliang, He, Xinyu, Liu, Lei, Liu, Tianming, Zhai, Xiaoming
This study aims at improving the performance of scoring student responses in science education automatically. BERT-based language models have shown significant superiority over traditional NLP models in various language-related tasks. However, science writing of students, including argumentation and explanation, is domain-specific. In addition, the language used by students is different from the language in journals and Wikipedia, which are training sources of BERT and its existing variants. All these suggest that a domain-specific model pre-trained using science education data may improve model performance. However, the ideal type of data to contextualize pre-trained language model and improve the performance in automatically scoring student written responses remains unclear. Therefore, we employ different data in this study to contextualize both BERT and SciBERT models and compare their performance on automatic scoring of assessment tasks for scientific argumentation. We use three datasets to pre-train the model: 1) journal articles in science education, 2) a large dataset of students' written responses (sample size over 50,000), and 3) a small dataset of students' written responses of scientific argumentation tasks. Our experimental results show that in-domain training corpora constructed from science questions and responses improve language model performance on a wide variety of downstream tasks. Our study confirms the effectiveness of continual pre-training on domain-specific data in the education domain and demonstrates a generalizable strategy for automating science education tasks with high accuracy. We plan to release our data and SciEdBERT models for public use and community engagement.
Prompt Tuning GPT-2 language model for parameter-efficient domain adaptation of ASR systems
Dingliwal, Saket, Shenoy, Ashish, Bodapati, Sravan, Gandhe, Ankur, Gadde, Ravi Teja, Kirchhoff, Katrin
Automatic Speech Recognition (ASR) systems have found their use in numerous industrial applications in very diverse domains creating a need to adapt to new domains with small memory and deployment overhead. In this work, we introduce domain-prompts, a methodology that involves training a small number of domain embedding parameters to prime a Transformer-based Language Model (LM) to a particular domain. Using this domain-adapted LM for rescoring ASR hypotheses can achieve 7-13% WER reduction for a new domain with just 1000 unlabeled textual domain-specific sentences. This improvement is comparable or even better than fully fine-tuned models even though just 0.02% of the parameters of the base LM are updated. Additionally, our method is deployment-friendly as the learnt domain embeddings are prefixed to the input to the model rather than changing the base model architecture. Therefore, our method is an ideal choice for on-the-fly adaptation of LMs used in ASR systems to progressively scale it to new domains.
10 NLP Predictions for 2022
Natural language processing (NLP) has been one of the hottest sectors in AI over the past two years. Will the string of big data breakthroughs continue into 2022? We checked in with industry experts to find out. There's been a veritable arms race to develop large transformer models over the past couple of years. It started in 2020 with OpenAI's GPT-3 with 175 billion parameters.